Sparse convolutional neural network for high-resolution skull shape completion and shape super-resolution

Traditional convolutional neural network (CNN) methods rely on dense tensors, which makes them suboptimal for spatially sparse data. In this paper, we propose a CNN model based on sparse tensors for efficient processing of high-resolution shapes represented as binary voxel occupancy grids. In contrast to a dense CNN that takes the entire voxel grid as input, a sparse CNN processes only on the non-empty voxels, thus reducing the memory and computation overhead caused by the sparse input data. We evaluate our method on two clinically relevant skull reconstruction tasks: (1) given a defective skull, reconstruct the complete skull (i.e., skull shape completion), and (2) given a coarse skull, reconstruct a high-resolution skull with fine geometric details (shape super-resolution). Our method outperforms its dense CNN-based counterparts in the skull reconstruction task quantitatively and qualitatively, while requiring substantially less memory for training and inference. We observed that, on the 3D skull data, the overall memory consumption of the sparse CNN grows approximately linearly during inference with respect to the image resolutions. During training, the memory usage remains clearly below increases in image resolution—an \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times 8$$\end{document}×8 increase in voxel number leads to less than \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times 4$$\end{document}×4 increase in memory requirements. Our study demonstrates the effectiveness of using a sparse CNN for skull reconstruction tasks, and our findings can be applied to other spatially sparse problems. We prove this by additional experimental results on other sparse medical datasets, like the aorta and the heart. Project page at https://github.com/Jianningli/SparseCNN.


Figure B
.1 shows the generated implants at resolution 512 2 × Z (without any postprocessing).

C. Additional Experiments on other Spatially Sparse Medical Images
In this section, additional experiments and results of sparse CNN-based superresolution on other spatially sparse medical images were provided.The dataset used in the experiments was obtained from the SegTHOR challenge (https://competitions.codalab.org/competitions/21145) that addresses the problem of automatic segmentation of organs at risk.The dataset contains 40 CT scans as well as the segmentation masks of the heart (green), aorta (yellow), trachea (blue) and esophagus (red), as can be seen from Figure C.1.The segmentation masks are spatially sparse with very low voxel occupancy rate (VOR), as can be seen from Table C.1.The dataset contains 20 CT scans without the ground truth segmentation masks for evaluation.The CT scans as well as the segmentation masks are of resolution 512 × 512 × Z.
Workflow: Firstly, we downsampled the images to 128 3 and trained a U-Net style dense CNN (1803988 trainable parameters) for automatic segmentation of the organs from the CT scans.Secondly, inference was run on the CT scans in the training and test set to generate the coarse (128 3 ) segmentation masks.Thirdly, the coarse masks were up-scaled to their original resolution 512 × 512 × Z via interpolation.Fourthly, we used the up-scaled masks as well as the original ground truth masks from the training  set to train a sparse CNN (the same sparse CNN used for skull super-resolution in the main manuscript) for super-resolution.Lastly, we run the inference of the trained sparse CNN on the up-scaled masks from the test set, to obtain the final high-resolution segmentation masks for these organs.

Figure
Figure A.1 shows a visual comparison of shape completion and super-resolution results at the same resolution level.

Figure
Figure A.1: Comparison of completed skulls at resolution 256.In each example, the first column shows the skull obtained from shape completion at resolution 256.The second and third column show the skull obtained from skull shape super-resolution from 64 and 128.The second row shows the colormap of signed mesh distance between predictions and the ground truth.

Figure B. 1 :
Figure B.1: Implants (second row) obtained by taking the difference between the defective skulls (first row) and the completed skulls at resolution 512.The last row shows the ground truth.

Figure
Figure C.1: A CT scan and the ground truth organ segmentation masks of the heart (green), aorta (yellow), trachea (blue) and esophagus (red) from the SegTHOR challenge.

Figure C. 2
Figure C.2 -C.5 show the qualitative results of the aorta, heart, esophagus and trachea images.It is worth noting that, as the organs in the dataset are even more sparse than the skulls (Table C.1), training on the full 512 × 512 × Z resolution for the superresolution task takes only moderate amount of GPU memory (Table C.1), while the super-resolution step can substantially improve the quality of the segmentation masks, as can be seen from Figure C.2 -Figure C.5. Figure C.6 shows the combined segmentation masks of the organs viewed in 2D and 3D, from sparse CNN.

Figure C. 2 :
Figure C.2: Super-resolution results of the aorta images.The first to last row shows the coarse aorta mask predictions (128 3 ) from the dense CNN, the up-scaled aorta masks (512 × 512 × Z) and the super-resolution output from sparse CNN (512 × 512 × Z).

Figure C. 3 :
Figure C.3: Super-resolution results of the heart images.The first to last row shows the coarse heart mask predictions (128 3 ) from the dense CNN, the up-scaled heart masks (512 × 512 × Z) and the super-resolution output from sparse CNN (512 × 512 × Z).

Figure
Figure C.4: Super-resolution results of the esophagus images.The first to last row shows the coarse esophagus mask predictions (128 3 ) from the dense CNN, the up-scaled esophagus masks (512 × 512 × Z) and the super-resolution output from sparse CNN (512 × 512 × Z).

Figure C. 5 :
Figure C.5: Super-resolution results of the trachea images.The first to last row shows the coarse trachea mask predictions (128 3 ) from the dense CNN, the up-scaled trachea masks (512 × 512 × Z) and the superresolution output from sparse CNN (512 × 512 × Z).

Figure
Figure C.6: The segmentation masks of the organs viewed in 2D (second column) and 3D (third column).The first column shows a slice of the CT scan.